Skip to content

Darwin improvements: indexing scaling, retention, analytics dashboard, tests#39

Open
rajivml wants to merge 9 commits intofeature/darwinfrom
feature/darwinimprovements
Open

Darwin improvements: indexing scaling, retention, analytics dashboard, tests#39
rajivml wants to merge 9 commits intofeature/darwinfrom
feature/darwinimprovements

Conversation

@rajivml
Copy link
Copy Markdown
Collaborator

@rajivml rajivml commented May 2, 2026

Summary

Multi-area enhancement PR landing the indexing-scaling, DB-retention, and admin-analytics work along with the supporting test infrastructure and connector / UI improvements built up across the Darwin development cycle.

  • Indexing scaling (NUM_INDEXING_WORKERS > 1 is now safe): per-cc-pair Postgres advisory lock + scheduler-side per-DocumentSource concurrency cap + per-attempt indexing priority.
  • DB retention (new daily Celery beat at 08:00 UTC): six policies (kombu_message, task_queue_jobs, index_attempt opt-in with keep-last-N, permission_sync_run, usage_reports, chat) with batched deletes, advisory locks, ANALYZE after large purges, and FK-safe chat retention covering search_doc orphans + LO blobs.
  • Analytics dashboard at /admin/analytics (Tremor): KPI tiles + AreaCharts + BarList, strict NPS calculation, Day/Month granularity toggle, daily rollup table populated 30 minutes before the retention sweep so historical analytics survive chat data deletion.
  • Slackbot resolved-button now records chat_feedback rows (powers the strict-NPS feedback signal).
  • Indexing-status admin page UX: status filter, name search, bulk pause/re-enable, pagination at 10/page.
  • Connector improvements: split Salesforce admin into sf-account / sf-kbarticles, new GitHub-Files connector, Slack channel-ID support, in-place credential edit forms.
  • DB migrations (4): chat UI perf indexes (CREATE INDEX CONCURRENTLY for online deploy), indexing-status perf indexes, index_attempt.indexing_priority column, analytics_daily_rollup table.
  • Test infrastructure: 4 new orchestrator scripts (analytics e2e, features e2e, Celery smoke, seed factory) under backend/scripts/, all using tag-isolated dummy data.
  • Documentation: AGENTS.md, CLAUDE.md, TESTING.md (new) + CONTRIBUTING.md updates covering the new env vars, scaling, retention windows, and stress-test profiles.
  • K8s: documented new env vars in deployment/kubernetes/env-configmap.yaml (all blank = defaults).

What changed (by area)

Backend

  • backend/danswer/db/retention.py (new) — six retention policies under one daily Celery beat task. Advisory lock + batched per-policy DELETEs. Rollback-before-unlock pattern in the finally so a failed-transaction state can't strand the lock on a pooled connection.
  • backend/danswer/db/analytics.py + backend/danswer/db/analytics_rollup.py + backend/danswer/server/analytics/api.py (new) — community-side analytics module parallel to the EE one. Endpoints under /api/analytics/admin/.... Rollup table populated by a checkpoint-driven daily task.
  • backend/danswer/configs/indexing_concurrency.py (new) + backend/danswer/background/update.py — scheduler-side per-DocumentSource cap. Over-cap NOT_STARTED attempts stay scheduled; no spurious FAILED rows.
  • backend/danswer/db/index_attempt.py + backend/danswer/background/indexing/run_indexing.py — per-cc-pair advisory lock prevents same-cc-pair concurrent runs.
  • backend/danswer/danswerbot/slack/handlers/handle_buttons.py + backend/danswer/danswerbot/slack/blocks.py — resolved button now records chat_feedback with predefined_feedback='resolved'.
  • backend/danswer/background/celery/celery_app.py — registers run_analytics_rollup_task (07:30 UTC) + run_retention_policies_task (08:00 UTC) beat entries.
  • backend/alembic/versions/9d02a9a5ce39 — task_queue_jobs / index_attempt indexes for the indexing-status page.
  • backend/alembic/versions/fd307e9ecc9b — adds index_attempt.indexing_priority + supporting index.
  • backend/alembic/versions/b5d3f1a9e7c2chat_message(chat_session_id) and chat_session(user_id) btree indexes via CREATE INDEX CONCURRENTLY.
  • backend/alembic/versions/c8a4e2f9d1b3analytics_daily_rollup table.

Frontend

  • web/src/app/admin/analytics/page.tsx (new) — full analytics dashboard with date range picker, KPIs, charts, BarList.
  • web/src/app/admin/indexing/status/CCPairIndexingStatusTable.tsx — status filter dropdown, name search, bulk action row, pagination=10, "Clear filters" button.
  • web/src/components/admin/Layout.tsx — sidebar entry for the analytics page.
  • web/src/app/admin/connectors/sf-account/page.tsx + sf-kbarticles/page.tsx — split from monolithic Salesforce page.
  • web/src/app/admin/connectors/github-files/page.tsx (new) — GitHub-Files setup UI.
  • web/src/app/admin/connector/[ccPairId]/CredentialSection.tsx (new) — generic in-place credential edit.

Infra / docs

  • deployment/kubernetes/env-configmap.yaml — adds documented blank entries for INDEXING_PER_SOURCE_CAP, RETENTION_DAYS_*, ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS.
  • AGENTS.md, CLAUDE.md, TESTING.md (new); CONTRIBUTING.md substantially expanded.
  • .gitignore — Claude Code /export outputs + stray requestdata.json debug payloads.

What's tested

Positive scenarios

Run end-to-end via the new orchestrator scripts under backend/scripts/:

  • ✅ Per-attempt priority — get_not_started_index_attempts returns rows in priority DESC, time_created ASC order; update_index_attempt_priority clamps to ceiling=100 and refuses on IN_PROGRESS rows. (test_features_e2e.py Phase 1)
  • ✅ index_attempt retention — with RETENTION_DAYS_INDEX_ATTEMPT=60 and KEEP_LAST_N=20, 25 SUCCESS rows aged 70-94d collapse to the 20 most recent; the 5 oldest are deleted. Also serves as a regression check on the column's case-sensitive uppercase status storage. (test_features_e2e.py Phase 2)
  • ✅ permission_sync_run retention preserves in_progress — 5 terminal (success/failed) rows aged 90d are deleted, 3 in_progress rows of the same age survive. (test_features_e2e.py Phase 3)
  • ✅ Resolved-button feedback DB write — create_chat_message_feedback(predefined_feedback='resolved', ...) writes exactly one row with the right shape against a slackbot-style session. (test_features_e2e.py Phase 4)
  • ✅ Analytics rollup populated correctly — 60 days of seeded chats produce ≥60 rows in analytics_daily_rollup with non-zero sums; checkpoint advances to today. (test_analytics_e2e.py Phase 2-3)
  • ✅ Strict NPS computation — derived from likes + resolved vs dislikes + needs_help, returns sensible values across seeded distributions. (test_analytics_e2e.py Phase 3)
  • ✅ Chat retention deletes old + leaves fresh — old chats (>30d) fully removed, fresh chats untouched, orphan search_docs swept, file LO blobs cleaned up. (test_analytics_e2e.py Phase 5)
  • ✅ Rollup data survives retention — analytics page still has 60+ days of data even after the chat retention sweep deletes the underlying chat_session/chat_message rows. (test_analytics_e2e.py Phase 6)
  • ✅ Rollup checkpoint idempotency — re-running the rollup task is a no-op for already-current dates. (test_analytics_e2e.py Phase 7)
  • ✅ Celery broker → worker plumbing — both run_analytics_rollup_task.delay() and run_retention_policies_task.delay() complete in <2s with state=SUCCESS and apply visible side effects. (test_celery_jobs_smoke.py)
  • ✅ Cleanup is tag-isolated — every test script removes only its own __test_*__ tagged rows; no risk to real data.

Negative / edge / failure scenarios

  • ✅ Concurrent retention run is correctly skipped — orphan retention advisory lock injected from a side process; the .delay()'d retention task succeeds quickly but skips its work; 5 seeded old chats remain in place (visible side-effect). (test_celery_jobs_smoke.py resilience loop)
  • ✅ Recovery from orphan lock — pg_terminate_backend(pid) on the lock-holder restores normal operation; subsequent retention run deletes everything it should.
  • ✅ Failed-transaction state in retention finally no longer strands the lock — verified by direct in-process call to run_retention_policies() after the rollback-before-unlock fix; no orphan lock left behind.
  • ✅ Boundary-precision on chat retention — fresh-chats-near-cutoff don't flicker between buckets due to now() drift between the before/after queries (frozen 28-day boundary in the orchestrator's Phase 5 keeps the assertion stable).
  • ✅ Per-source cap defers extra attempts — N GitHub cc-pairs with INDEXING_PER_SOURCE_CAP=1 produce one IN_PROGRESS at a time; the rest stay NOT_STARTED and get reconsidered each tick (no FAILED rows produced by capping).
  • ✅ Empty data scenarios — analytics endpoints return empty arrays when no data in range; retention runs are no-ops when there's nothing to delete.
  • ✅ Status casing regression check — the index_attempt status filter must use uppercase 'SUCCESS'/'FAILED' (the column stores enum NAMES, not .value); a lowercase regression silently no-ops the policy. Caught + verified.
  • ✅ NPS undefined when no feedback — UI displays instead of NaN/error.

Manual UI smoke (per TESTING.md)

Verify after python scripts/test_analytics_e2e.py --yes --keep-data:

  • /admin/analytics renders all KPIs + 3 charts with non-zero numbers.
  • Date range picker (Last 7d, 30d, 90d, custom) updates charts.
  • Day / Month granularity toggle reshapes the charts; subtitles update.
  • /admin/indexing/status: source filter, status filter, name search, bulk pause/re-enable, pagination.
  • No console errors.

Deployment notes

  1. alembic upgrade head — applies all four new migrations (chat UI perf indexes use CREATE INDEX CONCURRENTLY and won't block writes).
  2. python scripts/backfill_analytics_rollup.py — one-time, populates analytics_daily_rollup from existing chat data BEFORE the next retention sweep deletes anything. Skipping this leaves the dashboard at zero for historical ranges until the daily task accumulates enough days.
  3. Restart background-deployment so Celery beat picks up the new schedule entries (run-analytics-rollup and run-retention).

The new env vars (all defaulted sensibly):

  • INDEXING_PER_SOURCE_CAP (default 1)
  • RETENTION_DAYS_KOMBU (7) / RETENTION_DAYS_TASK_QUEUE (30) / RETENTION_DAYS_INDEX_ATTEMPT (0=disabled) / RETENTION_KEEP_LAST_N_INDEX_ATTEMPTS (20) / RETENTION_DAYS_CHAT (30) / RETENTION_DAYS_PERMISSION_SYNC (30) / RETENTION_DAYS_USAGE_REPORTS (90) / RETENTION_BATCH_SIZE (5000) / RETENTION_MAX_BATCHES (200)
  • ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS (default 2; MUST be < RETENTION_DAYS_CHAT)

Known follow-ups (not in this PR)

  • PermissionSyncRun.update_type and .status columns lack native_enum=False in the model; SQLAlchemy 2.x bulk-insert paths fail with a ::permissionsyncjobtype cast error against the varchar columns. Single-row ORM inserts may work; bulk inserts don't. Workaround already in place via raw SQL where needed (seed factory). Worth a separate model-fix PR — no migration needed (column type is already varchar).

User-facing admin capabilities (call-out)

A few of the new behaviours that affect day-to-day operator workflows, called out separately from the deeper technical changes above:

  • Bump priority of a queued indexing attempt — without code changes or restarts. A manual Re-Index from the cc-pair page can now be given a higher priority so it jumps ahead of auto-scheduled work in the Dask queue. Implementation:

    • Backend: PATCH /admin/index-attempt/{index_attempt_id}/priority (backend/danswer/server/documents/connector.py:646) calls update_index_attempt_priority (clamps to 0–100; refuses on rows that have already moved out of NOT_STARTED).
    • Frontend: web/src/app/admin/connector/[ccPairId]/IndexingAttemptsTable.tsx shows a priority column + bump control; ReIndexButton.tsx lets you set the priority at trigger time.
    • Schema: new index_attempt.indexing_priority integer column + (status, indexing_priority, time_created) index (Alembic fd307e9ecc9b).
  • Edit credentials in place — no need to delete and recreate the connector. The new web/src/app/admin/connector/[ccPairId]/CredentialSection.tsx component renders a per-cc-pair credential edit form with the field-set inferred from the connector source. Integrated across all touched connector pages: sf-account, sf-kbarticles, github, github-files, confluence, jira, slack, sharepoint. Backed by the existing PUT /admin/credential/... API, exposed through web/src/lib/credential.ts.

  • Indexing-status page filters and bulk actions (already mentioned above, summarised here for completeness): source-type dropdown + status dropdown + name search + bulk pause/re-enable buttons + 10/page pagination + "Clear filters" button. All filter state resets pagination so bulk actions never operate on hidden rows.

  • Analytics dashboard at /admin/analytics (already covered) — first community-edition analytics page in this fork; all six existing-vs-new endpoints feed it.

Concurrency hardening follow-up (commit e31f2104)

Three independent concurrency bugs surfaced while validating the indexing-scaling work; all fixed in a single follow-up commit on the same branch.

1. Per-source cap leak (the "two slack runs at once + lower-priority wins" bug)

Symptom: with INDEXING_PER_SOURCE_CAP=1 and 7 Slack cc-pairs (one priority-bumped to 20), users observed two Slack runs going simultaneously, and the priority-bumped attempt sometimes lost the worker race to a priority-0 attempt.

Root cause: kickoff_indexing_jobs built running_per_source purely from DB rows with status=IN_PROGRESS. Attempts the scheduler had already submitted to Dask but which were still NOT_STARTED in the DB (the queue / worker-spinning-up window — typically a few seconds) were invisible to the cap. A subsequent tick saw 0 slack IN_PROGRESS, fell through to the next slack candidate, and submitted it. Both were now in Dask's queue; whichever the worker pulled first won, regardless of priority.

Fix (backend/danswer/background/update.py):

  • Extracted _build_running_view and _evaluate_dispatch_for_attempt pure helpers (also makes the logic unit-testable without mocking SQLAlchemy).
  • The view now folds existing_jobs (filtered to non-terminal IndexAttempt rows via status.notin_([SUCCESS, FAILED])) into running_per_source AND in_progress_cc_pair_keys, alongside the DB IN_PROGRESS query. accounted_attempt_ids dedups so an attempt that's in both lists isn't counted twice.
  • Added a scheduler-side per-cc-pair collision guard (catches manual Re-Index colliding with auto-scheduled run before it reaches the worker).
  • Lock-contention path in the worker (backend/danswer/background/indexing/run_indexing.py) now reverts to NOT_STARTED instead of writing FAILED rows; rollback-before-unlock pattern in finally to handle aborted-transaction unlock failures.

2. Connector-deletion lock storm (API + worker)

Symptom: 6 cleanup_connector_credential_pair_task invocations all failing within a 35 ms window with Failed to acquire locks after 10 attempts for documents: [...140 doc IDs...]. Each had spent the full _NUM_LOCK_ATTEMPTS × _LOCK_RETRY_DELAY = 5 minutes retrying SELECT ... FOR UPDATE NOWAIT over the same docs.

Root cause: /admin/deletion-attempt (backend/danswer/server/manage/administrative.py) had no in-flight dedup. Six clicks of "Delete connector" (or any retry loop) all called apply_async and queued parallel cleanup tasks. They raced over the same documents, retried, all timed out together.

Fix (3 layers, defense in depth):

  1. API-side dedup: /admin/deletion-attempt now returns HTTP 409 if a cleanup task for the same cc-pair is already live in task_queue_jobs (via get_latest_task + check_task_is_live_and_not_timed_out). Most user-friendly path.
  2. Worker-side advisory lock: cleanup_connector_credential_pair_task acquires a per-cc-pair advisory lock at entry (new helpers try_acquire_deletion_lock / release_deletion_lock in backend/danswer/db/connector_credential_pair.py, namespace b"DELE" — distinct from the indexing b"INDX" namespace). On contention: log + return 0. Released via rollback-before-unlock in finally. This is the safety net for any caller that bypasses the API endpoint.
  3. Existing row-level locks (prepare_to_modify_documents) are unchanged — they still serve their original purpose of preventing concurrent indexer / deletion writes to the same documents.

3. UI: Refresh button on connectors-status page

web/src/app/admin/indexing/status/page.tsx — added a Refresh button (icon: FiRefreshCw) above the connectors table that calls SWR mutate() to force an immediate refetch. Loading state driven by isValidating. The existing 10 s background poll is unchanged; this is a manual nudge between ticks for users who don't want to wait.

What's tested (follow-up commit)

Unit tests — backend/tests/unit/

  • danswer/background/test_indexing_scheduler.py16 tests:
    • 11 targeted dispatch-decision scenarios (priority sort, cap=0/1/2, cc-pair guard, the cap-leak fix, no double-counting).
    • 5 randomized fuzz / stress tests, each running 200 deterministic seeded iterations: _SimState simulator that models Dask flip / finish / crash probabilities, asserting per-source cap and cc-pair invariants every tick.
    • Includes test_buggy_view_without_fix_leaks_cap_regression_guard — explicitly simulates the pre-fix codepath to prove the harness is sensitive enough to detect a regression.
  • danswer/db/test_deletion_lock_keys.py6 tests for the deletion advisory-lock key derivation: namespace distinct from indexing lock, deterministic, no small-space collisions, within Postgres bigint range.

22 unit tests total; ~5 s runtime.

E2E tests — backend/scripts/

  • test_scheduler_e2e.py9 phases against live Postgres with a recording fake Dask client:
    • A: priority order under cap=1 (mirrors user's exact scenario).
    • B: cap-leak regression (dispatched-but-pre-completion attempt holds its source slot).
    • C: per-cc-pair guard with a real IN_PROGRESS row (Re-Index collision).
    • D: different sources have independent caps.
    • E: same-tick same-cc-pair double-submit prevented.
    • G: completed (SUCCESS) existing_jobs entries don't consume cap slots — exercises the status.notin_(...) SQL filter against live ORM.
    • H: same for FAILED.
    • I: IN_PROGRESS attempts in BOTH the DB query and existing_jobs aren't double-counted.
    • F: 20-iteration randomized soak with full invariant checks.
  • test_deletion_lock_e2e.py6 phases against live Postgres:
    • A: lock primitive across two SQLAlchemy sessions.
    • B: held lock causes task body to early-return 0 in milliseconds (vs. pre-fix 5-min retries).
    • C: lock released after task succeeds (clean delete).
    • D: lock released after task body raises (rollback-before-unlock).
    • E: 6 threads racing the same cc-pair → all 6 complete in ~0.02 s; mix of "1 acquired (raised)" + "5 skipped (returned 0)". Pre-fix would be ≥5 min × 6.
    • F: API dedup logic — task_queue_jobs STARTED+recent → 409, SUCCESS → allow, ancient STARTED → allow (timed out).

Both e2e suites: pass against live Postgres in ~5 s combined. Each script auto-cleans tagged data before and after; refuses to run if the production indexer is up (would race on seeded NOT_STARTED rows).

Pre-commit verification

All hooks pass on the commit (black, reorder-python-imports, autoflake, ruff, prettier).

🤖 Generated with Claude Code

rajivml and others added 9 commits May 2, 2026 19:52
…, tests

Indexing scaling (NUM_INDEXING_WORKERS > 1 now safe):
- Per-cc-pair Postgres advisory lock prevents same-cc-pair concurrent runs
- Scheduler-side per-DocumentSource cap (INDEXING_PER_SOURCE_CAP, default 1)
  defers over-cap NOT_STARTED attempts in update.py rather than fail-fast
- Per-attempt indexing_priority column lets manual triggers jump the queue

DB retention (new daily Celery beat at 08:00 UTC, danswer/db/retention.py):
- 6 policies: kombu_message, task_queue_jobs, index_attempt (opt-in,
  keep-last-N), permission_sync_run, usage_reports, chat
- Batched DELETEs (5000/batch) under a single advisory lock + ANALYZE after
  large purges. Rollback-before-unlock pattern prevents the lock from getting
  stranded on a failed-transaction state.
- Chat retention is FK-safe: cleans search_doc orphans + file_store rows +
  Postgres LO blobs attached to chat_message.files. SELECT FOR UPDATE on
  session rows blocks new chat_message inserts during the batch transaction.
- One-shot CLI: scripts/cleanup_stale_db.py --dry-run / --policy=...

Analytics dashboard (community module, parallel to the EE one):
- New /admin/analytics page using Tremor (already in deps): KPI tiles +
  AreaCharts + BarList. Day/Month granularity toggle, date range picker,
  strict NPS computation (likes + resolved vs dislikes + needs_help).
- 6 backend endpoints under /api/analytics/admin/{query, user, danswerbot,
  total-docs, docs-per-source, slack-channels}.
- analytics_daily_rollup table + checkpoint-based daily Celery beat at
  07:30 UTC (30 min before retention sweep) so the dashboard survives chat
  retention deletes. Backfill CLI: scripts/backfill_analytics_rollup.py.

Slackbot resolved-button feedback (NEW DB write):
- handle_followup_resolved_button now records chat_feedback with
  predefined_feedback='resolved'. Threaded message_id through
  build_follow_up_resolved_blocks so the second resolved button has it too.
- Powers the strict-NPS calculation on the analytics dashboard.

Indexing-status admin page UI (improvements):
- Status filter dropdown (success / failed / in_progress / not_started /
  paused), search-by-connector-name with icon, bulk pause/re-enable actions
  in a dedicated row, pagination at 10/page, "Clear filters" button.

Connector improvements:
- Split monolithic Salesforce admin page into sf-account and sf-kbarticles
  with shared ConnectedApp credentials.
- New github-files connector for indexing repo file contents.
- Slack connector accepts channel IDs alongside channel names.
- Per-cc-pair credential edit forms across multiple connector pages.

DB migrations (4 new):
- 9d02a9a5ce39: indexing-status perf indexes (task_queue_jobs, index_attempt)
- fd307e9ecc9b: index_attempt.indexing_priority column + supporting index
- b5d3f1a9e7c2: chat UI perf indexes (chat_message.chat_session_id,
  chat_session.user_id) — CREATE INDEX CONCURRENTLY for online deploy
- c8a4e2f9d1b3: analytics_daily_rollup table

Test infrastructure (new):
- scripts/seed_test_data.py — auto-generated tag-isolated fixtures with
  configurable knobs (--days, --chats-per-day, feedback distribution,
  --with-old-data, --with-search-docs, etc.)
- scripts/test_analytics_e2e.py — analytics + chat retention orchestrator
- scripts/test_features_e2e.py — priority + index_attempt retention +
  permission_sync + resolved-feedback DB write
- scripts/test_celery_jobs_smoke.py — broker → worker plumbing check
  via .delay() against fresh dummy data

Documentation:
- AGENTS.md + CLAUDE.md (new) — agent operating notes for this fork
- TESTING.md (new) — test scripts + manual UI checklist
- CONTRIBUTING.md — sections on indexing scaling, per-source cap, retention
  env vars, analytics rollup, testing pipeline, all stress-test profiles

Kubernetes:
- env-configmap.yaml documents the new INDEXING_PER_SOURCE_CAP,
  RETENTION_DAYS_*, and ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS env vars
  (all blank = use code defaults).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uff E402 + prettier

CI pre-commit was failing on the prior commit (6cbdb06). Five hooks needed
satisfying:

- black: 20 files reformatted (line wrapping, trailing commas).
- reorder-python-imports: 9 files reordered (alphabetical within blocks).
- autoflake: removed unused imports / variables.
- ruff E402 (real bug, not just style): 6 late module-level imports in
  backend/danswer/db/index_attempt.py were trailing the new advisory-lock
  helpers (lines 89-94). Hoisted them to the top with the other imports —
  no actual circular dep, the late placement was incidental.
- prettier: 8 frontend files reformatted (trailing commas, line breaks).

Verified after the fixes:
- pre-commit run --from-ref feature/darwin --to-ref HEAD: all 5 hooks pass.
- mypy clean on the touched backend modules (retention, analytics_rollup,
  analytics, server/analytics/api, celery_app, update, run_indexing,
  index_attempt).
- Pre-existing tsc errors in sf-account/sf-kbarticles pages are unchanged
  (not introduced by formatters; will be fixed in a follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `deployment/kubernetes/analytics-bootstrap-job.yaml` — a Kubernetes
batch/v1 Job that runs ONCE in the `darwin` namespace and:

  1. `alembic upgrade heads` to apply the new analytics_daily_rollup
     migration (and any other pending revisions). Idempotent.
  2. `python scripts/backfill_analytics_rollup.py` to walk every
     historical date that still has chat data and populate the rollup
     table + checkpoint. Idempotent via INSERT…ON CONFLICT(date).

Apply ONCE, after the new backend image with the PR's code rolls out
to api-server / background-deployment, and BEFORE the next 08:00 UTC
retention sweep on a fresh DB. After this Job completes, the daily
Celery beat task at 07:30 UTC takes over.

Why a Job (not Deployment / Pod):
  - Deployment auto-restarts on container exit — wrong for one-time work.
  - Bare Pod doesn't track success / failure cleanly.
  - Job has run-to-completion + retry-on-failure (backoffLimit=3) +
    auto-cleanup (ttlSecondsAfterFinished=3600). Standard K8s pattern
    for one-shot maintenance.

Mirrors the api-server's image / env / volumes:
  - Same image (placeholder: vha-119 — bump to the post-merge tag).
  - POSTGRES_USER + POSTGRES_PASSWORD via danswer-secrets.
  - envFrom env-configmap (POSTGRES_HOST, encryption keys, etc.).
  - dynamic-pvc + file-connector-pvc volumes (defensive parity).

`activeDeadlineSeconds: 1800` caps the Job at 30 minutes — generous
even for very large chat histories. `restartPolicy: OnFailure` retries
within the same Pod before backoffLimit kicks in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CONTRIBUTING.md was carrying ~120 lines of testing how-to (orchestrator
fast paths, stress-test profiles, edge cases, knob reference, etc.) that
properly belong in TESTING.md. CONTRIBUTING.md is now setup-focused:
guidelines, local setup, env vars, formatting, release process — with a
short pointer to TESTING.md for the testing details.

TESTING.md gains four new sections that previously lived in CONTRIBUTING.md:
  - Stress-test profiles (Medium / Heavy / Massive single-line variants)
  - "What 'stress' actually exercises" caveat
  - Edge-case scenarios (all-positive, all-negative, slackbot-only)
  - Quick state check after seeding (psql one-liner)
  - Knob reference table

CONTRIBUTING.md shrank from 705 → 601 lines; TESTING.md grew from 260
→ 334 lines. Net ~30 lines deduped. No content lost — only relocated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why:
* Per-source cap leaked because `kickoff_indexing_jobs` only counted
  DB-IN_PROGRESS rows; attempts queued in Dask but not yet IN_PROGRESS
  slipped past the cap, letting 2+ same-source attempts run at once
  and causing the higher-priority queued attempt to lose to a lower-
  priority one that won the worker race.
* Connector deletion API had no in-flight dedup. Multiple "Delete
  connector" clicks queued parallel `cleanup_connector_credential_pair_task`
  invocations that each hit `SELECT ... FOR UPDATE NOWAIT` over the
  same documents, retried 10x30s, and all timed out.

What:
* `update.py`: extracted `_build_running_view` and
  `_evaluate_dispatch_for_attempt` pure helpers; the view now folds
  both DB-IN_PROGRESS and existing_jobs (filtered to non-terminal
  rows) into per-source / per-cc-pair accounting. Added scheduler-side
  per-cc-pair collision guard alongside the existing per-source cap.
* `run_indexing.py`: lock-contention path reverts attempts to
  NOT_STARTED instead of writing FAILED rows; rollback-before-unlock
  pattern in finally to handle aborted-transaction unlock failures.
* `connector_credential_pair.py`: added `try_acquire_deletion_lock` /
  `release_deletion_lock` (b"DELE" namespace, distinct from b"INDX").
* `celery_app.py`: `cleanup_connector_credential_pair_task` now
  acquires the deletion advisory lock at entry and releases via
  rollback-before-unlock in finally. Skips work + returns 0 on
  contention.
* `administrative.py`: `/admin/deletion-attempt` returns HTTP 409 if a
  cleanup task for this cc-pair is already live in `task_queue_jobs`.
* `web/.../status/page.tsx`: added Refresh button (with loading state
  driven by SWR `isValidating`) above the connectors table.

Tests:
* 22 unit tests in `tests/unit/danswer/background/test_indexing_scheduler.py`
  + `tests/unit/danswer/db/test_deletion_lock_keys.py`. Includes
  randomized fuzz (200 iterations) and a regression guard that
  simulates the pre-fix codepath to confirm test sensitivity.
* `scripts/test_scheduler_e2e.py`: 9 phases against live Postgres
  (priority sort, cap-leak fix, per-cc-pair guard, completed/FAILED
  in_flight not consuming cap, no double-counting, 20-iteration soak).
* `scripts/test_deletion_lock_e2e.py`: 6 phases against live Postgres
  (lock primitive cross-session, held-lock blocks task, lock release
  on success/exception, 6-thread race, API dedup logic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real bugs that fail `next build`'s lint stage on a clean checkout
(severity:error, not just warnings):

* `chat/lib.tsx::useScrollonStream` was declared `async`, which
  silently broke the hook contract — React Hooks cannot be called
  inside an async function. The function never actually awaits
  anything; the `async` keyword was a copy-paste error. Removing it
  fixes 6 rules-of-hooks errors (4 useRef + 2 useEffect calls).
* `WelcomeModal.tsx::_WelcomeModal` violated the React component
  naming rule (must start with an uppercase letter). The leading
  underscore caused the lint hook-rules visitor to refuse to treat
  the function as a component, producing 6 rules-of-hooks errors
  (1 useRouter + 4 useState + 1 useEffect). Renamed to
  `WelcomeModalContent` (kept distinct from the wrapper's exported
  `WelcomeModal`) and updated the single import site.

Pre-existing react-hooks/exhaustive-deps and no-img-element warnings
are intentionally left untouched in this commit — they don't fail the
build and address them is out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were latent in the codebase: yup `Yup.object().shape({...})`
was producing inferred shapes that didn't match the declared TS
interfaces, so `next build`'s type-check refused to compile after
the rules-of-hooks errors were cleared.

Three files, four schemas:

* `lib/types.ts` — `GithubConfig.repo_name: string` → `?: string`.
  The Github connector form's helper text already says "leave blank
  to index every repo the access token can see under this owner",
  the yup schema doesn't `.required()` it, and existing call sites
  use `(values.repo_name || "").trim()`. The TS type was the only
  thing claiming it was always present.
* `components/admin/connectors/ConnectorTitle.tsx` — the only
  consumer that read `repo_name` directly now renders just the
  owner when name is blank (instead of "owner/undefined").
* `admin/connectors/sf-account/page.tsx` (3 schemas) — yup schemas
  were missing the optional `sf_credential_kind` discriminator and
  the optional `requested_objects` config field, causing the
  inferred Shape<> to not match `SalesforceCredentialJson` /
  `SalesforceConfig`.
* `admin/connectors/sf-kbarticles/page.tsx` (2 schemas) — same
  `sf_credential_kind` fix as sf-account.

`tsc --noEmit` clean after these changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues that surfaced when applying the Job to AKS:

1. backend/Dockerfile — only `force_delete_connector_by_id.py` was
   being copied into `/app/scripts/`, so the bootstrap Job's
   `python scripts/backfill_analytics_rollup.py` failed with
   `[Errno 2] No such file or directory`. Add the backfill script
   to the runtime image. Safe — it's idempotent and admin-invoked.

2. deployment/kubernetes/analytics-bootstrap-job.yaml — `set -euo
   pipefail` failed with "Illegal option -o pipefail" because the
   image's `/bin/sh` is dash (Debian slim base), not bash. The
   script has no pipes anyway, so plain `set -eu` is sufficient
   and portable across every POSIX sh.

After this commit, build + push a new backend image tag and bump
the Job's `image:` to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /admin/indexing/status page polled all cc-pairs every 10s, which
hurts environments with hundreds of (mostly disabled) connectors.
Three coordinated wins:

1. Backend: `/admin/connector/indexing-status` accepts a new optional
   `disabled` query param. `disabled=false` returns enabled connectors
   only, `disabled=true` returns disabled, omitted = all. Filter is
   applied after the existing bulk-fetch (the ~400-row query is cheap;
   the win is in the JSON response size and downstream rendering).

2. Frontend: page now defaults to "Enabled only" via a Show dropdown
   (Enabled / Disabled / All), passing the filter as part of the SWR
   cache key. Environments with mostly-paused historical connectors
   stop shipping them over the wire on every poll.

3. Frontend SWR options:
   - `refreshInterval`: 10s -> 30s. 10s was unnecessarily aggressive
     for an admin overview page.
   - `refreshWhenHidden: false`. Backgrounded admin tabs stop polling.
   - `revalidateOnFocus: true`. Stale view re-fetches when the tab
     regains focus.

4. Refresh button loading state is now bound to a manual-refresh flag
   instead of `isValidating`, so the spinner only spins when the user
   clicks Refresh - not on every background poll.

Also bumps the analytics-bootstrap Job image to vha-121 to align with
the upcoming backend rebuild that ships scripts/backfill_analytics_rollup.py
(per the previous commit's Dockerfile change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant